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Abstract 

A mixture of factor analyzers is a semi-parametric density estimator that 
generalizes the well-known mixtures of Gaussians model by allowing each 
Gaussian in the mixture to be represented in a different lower-dimensional 
manifold. This paper presents a robust and parsimonious model selection 
algorithm for training a mixture of factor analyzers, carrying out simulta¬ 
neous clustering and locally linear, globally nonlinear dimensionality reduc¬ 
tion. Permitting different number of factors per mixture component, the 
algorithm adapts the model complexity to the data complexity. We compare 
the proposed algorithm with related automatic model selection algorithms on 
a number of benchmarks. The results indicate the effectiveness of this fast 
and robust approach in clustering, manifold learning and class-conditional 
modeling. 
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1. Introduction 


Mixture models have a widespread use in various domains of machine 
learning and signal processing for supervised, semi-supervised and unsuper¬ 
vised tasks BE). However, the model selection problem remains to be one 
of the challenges and there is a need for efficient and parsimonious automatic 
model selection methods [3J. 

Let x denote a random variable in M d . A mixture model represents the 
distribution of x as a mixture of K component distributions: 

K 

p( x ) = ^p( x \G k )p(G k ), (1) 

k =1 

where Gk correspond to components, and p ( Gk ) are the prior probabilities of 
the components, p (Gk) are also called the mixture proportions , and sum up 
to unity. The likelihood term, expressed by p (cc| Gk), can be modeled by any 
distribution. In this paper we focus on Gaussians: 

p(x\Gk) ~ Af(>fc,£ifc), ( 2 ) 

where pL k and denote the mean and covariance of the k th component 
distribution, respectively. The number of parameters in the model is primar¬ 
ily determined by the dimensionality of the covariance matrix, which scales 
quadratically with the feature dimensionality d. When this number is large, 
overfitting becomes an issue. Indeed, one of the most important problems of 
model-based clustering methods is that they are over-parametrized in high¬ 
dimensional spaces [4] • One way of keeping the number of parameters small 
is to constrain the covariance matrices to be tied (shared) across components, 
which assumes similar shaped distributions in the data space, and is typically 
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unjustified. Another approach is to assume that each covariance matrix is 
diagonal or spherical, but this means valuable correlation information will 
be discarded. 

It is possible to keep a low number of parameters for the model without 
sacrificing correlation information by adopting a factor analysis approach. 
Factor Analysis (FA) is a latent variable model, which assumes the observed 
variables are linear projections of a small number of independent factors 2 
with additive Gaussian noise: 

x = Az + u, z ~ A/"(0, 1), u ~ A/"(0, th), (3) 

where A is a d x p factor loading matrix and th is a diagonal uniquenesses 
matrix representing the common sensor noise. Subsequently, the covariance 
matrix in Eq. [2] is expressed as = A^A^ + th, effectively reducing the 
number of parameters from 0(d 2 ) to 0(dp ), with p « d. If each Gaussian 
component is expressed in a latent space, the result is a mixture of factor 
analyzers (MoFA). 

Given a set of data points, there exists Expectation-Maximization (EM) 
approaches to train MoFA models but these approaches require the 

specification of hyper-parameters like the number of clusters and the num¬ 
ber of factors per component. For the model selection problem of MoFA, 
an incremental algorithm (IMoFA) was proposed in [6], where factors and 
components were added to the mixture one by one. The model complexity 
was monitored on a separate validation set. 

In this study, we propose a fast and parsimonious model selection al¬ 
gorithm called Adaptive Mixture of Factor Analyzers (AMoFA). Similar to 
IMoFA, AMoFA is capable of adapting a mixture model to data by selecting 
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an appropriate number of components and factors per component. However, 
the proposed AMoFA algorithm deals with two shortcomings of the IMoFA 
approach: 1) Instead of relying on a validation set, AMoFA uses a Mini¬ 
mum Message Length (MML) based criterion to control model complexity, 
subsequently using more training samples in practice. 2) AMoFA is capable 
of removing factors and components from the mixture when necessary. We 
test the proposed AMoFA approach on several benchmarks, comparing its 
performance with IMoFA, with a variational Bayesian MoFA approach [7], as 
well as with the popular Gaussian mixture model selection approach based on 
MML, introduced by Figueiredo and Jain [S]. We show that the proposed ap¬ 
proach is parsimonious and robust, and especially useful for high-dimensional 
problems. 

The layout of the paper is organized as follows. In the next section we 
review related work in model selection. AMoFA algorithm is introduced 
in Section [3j Experimental results are presented in Section |4} Section [5] 
discusses our findings, and concludes with future directions. 

2. Related Work 

There are numerous studies for mixture model class selection. These 
include using information theoretical trade-offs between likelihood and model 
complexity greedy approaches [HE] and full Bayesian 

treatment of the problem mnsuisun]. A brief review of related automatic 
model selection methods is given in Table [lj a detailed treatment can be 
found in [4], Here we provide some detail on the most relevant automatic 
model selection methods that are closely related to our work. 
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Table 1: Automatic Mixture Model Selection Approaches 
Work Model Selection Approach 


Variational Bayes Incremental 

MDL Incremental 


Ghahramani Sz Beal (1999) [7] 
Pelleg & Moore (2000) fl8l 
Rasmussen (2000) US] 

Figueiredo Sc Jain (2002) [8] 
Verbeek et al. (2003) |14| 

Law et al. (2004) |19| 

Zivkovic Sz v.d. Heijden (2004) [20] 
Salah Sz Alpaydin (2004) [6] 

Shi & Xu (2006) [2ll 
Constantinopoulos et al. (2007) m 
Gomes et al. (2008) 1161 
Boutemedjet et al. (2009) m 
Gorur Sz Rasmussen (2009) |24l 
Shi et al. (2011) fI71 
Yang et al. (2012) |25l 
Iwata et al. (2012) m 
Fan Sz Bouguila (2013) [27] 

Fan Sz Bouguila (2014) |28| 

Kersten (2014) [29] 


MC for DPMM 

Both 

MML 

Decrement al 

Fixed iteration 

Incremental 

MML 

Decrement al 

MML 

Decrement al 

Cross Validation 

Incremental 

Bayesian Yin-Yang 

Both 

Variational Bayes 

Incremental 

Variational DP 

Incremental 

MML 

Decrement al 

MC for DPMM 

Both 

Bayesian Yin-Yang 

Both 

Entropy Min. 

Decrement al 

MC for DPMM 

Both 

Variational DP 

Both 

Variational Bayes 

Incremental 

MML 

Decrement al 
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In one of the most popular model selection approaches for Gaussian mix¬ 
ture models (GMMs), Figueiredo and Jain proposed to use an MML criterion 
for determining the number of components in the mixture, and shown that 
their approach is equivalent to assuming Dirichlet priors for mixture propor¬ 
tions |8j. In their method, a large number of components (typically 25-30) 
is fit to the training set, and these components are eliminated one by one. 
At each iteration, the EM algorithm is used to find a converged set of model 
parameters. The algorithm generates and stores all intermediate models, and 
selects one that optimizes the MML criterion. 

The primary drawback of this approach is the curse of dimensionality. For 
a d-dimensional problem, fitting a single full-covariance Gaussian requires 
0(d 2 ) parameters, which typically forces the algorithm to restrict its models 
to diagonal covariances in practice. We demonstrate empirically that this 
approach (unsupervised learning of finite mixture models - ULFMM) does 
not perform well in practice for problems with high dimensionality, regardless 
of its abundant use in the literature. 

Using the parsimonious factor analysis representation described in Sec¬ 
tion [lj it is possible to explore many models that are between full-covariance 
and diagonal Gaussian mixtures in their number of parameters. The resulting 
mixture of factor analysers (MoFA) can be considered as a noise-robust ver¬ 
sion of the mixtures of probabilistic principal component analysers (PPCA) 
approach [30]. Figure [l] summarizes the relations between the mixture rep¬ 
resentations in this area. 

If we assume that the latent variables of each component Q *. in a MoFA 
model is distributed unit normal (A/"(0,/)) in the latent space, the corre- 
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Figure 1: Relationship of MoFA with some well known latent variable and mixture models. 
Model parameters are given in curly brackets, n: (1 x K) component priors, p: (lxd) 
component mean, A: (p x d) factor loading matrix, 'F: (1 x d) diagonal noise variances 
(uniqueness), S: (d x d) component covariance.) K denotes the number of components, d 
the feature dimensionality, and p the subspace dimensionality with p « d. 
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sponding data in the feature space are also Gaussian distributed: 

p{X\z,Q k ) = N{v>k + AfcZ,^), (4) 

where z denotes the latent factor values. The mixture distribution of K 
factor analyzers is then given as |5j: 

A' 

p(X) = ^ / p(A|z,C/^p(z|C/ fc )p((/ fc )(iz. (5) 

fc=i ^ 

The EM algorithm is used to find maximum likelihood solutions to latent 
variable models [31], and it can be used for training a MoFA [5]. Since 
EM does not address the model selection problem, it requires the number of 
components and factors per component to be fixed beforehand. 

Ghahramani and Beal [7] have proposed a variational Bayes scheme (VB- 
MoFA) for model selection in MoFA, which allows the local dimensionality 
of components and their total number to be automatically determined. In 
this study, we use VBMoFA as one of the benchmarks. 

To alleviate the computational complexity of the variational approach, a 
greedy model selection algorithm was proposed by Salah and Alpaydm jfTj. 
This incremental approach (IMoFA) starts by fitting a single component - 
single factor model to the data and adds factors and components in each 
iteration using fast heuristic measures until a convergence criterion is met. 
The algorithm allows components to have as many factors as necessary, and 
uses a validation set to stop model adaptation, as well as to avoid over-fitting. 
This is the third algorithm we use to compare with the proposed approach, 
which we describe in detail next. 


3. Adaptive Mixtures of Factor Analyzers 


We briefly summarize the proposed adaptive mixtures of factor analyzers 
(AMoFA) algorithm first, and then describe its details. Given a dataset X 
with N data points in d dimensions, the AMoFA algorithm is initialized by 
fitting a 1-component, 1-factor mixture model. Here, the factor is initialized 
from the leading eigenvector of the covariance matrix i. e. the principal com¬ 
ponent of the data. At each subsequent step, the algorithm considers adding 
more components and factors to the mixture, running EM iterations to find 
a parametrization. During the M-step of EM, an MML criterion is used 
to determine whether any weak components should be annihilated. Apart 
from this early component annihilation, the algorithm incorporates a second 
decremental scheme. When the incremental part of the algorithm no longer 
improves the MML criterion, a downsizing component annihilation process is 
initiated and all components are eliminated one by one. Similar to ULFMM, 
each intermediate model is stored, and the algorithm outputs the one giving 
the minimum message length. Figure [2] summarizes the proposed algorithm. 

3.1. The Generalized Message Length Criterion 

To allow local factor analyzers to have independent latent dimensionality, 
the MML criterion given in Figueiredo and Jain j8j should be generalized 
accordingly to reflect the individual code length of components: 



fc:7Tfc>0 


( 6 ) 



fc:7T/ c >0 
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Figure 2: Outline of the AMoFA algorithm 


algorithm AMoFA (training set A) 
/^Initialization*/ 

[A, /i, 'I'] train a 1-component, 1-factor model 

repeat 

/*Perform a single split*/ 


x t— Select a component for splitting via Eq. (10) 
[A 1; /x l5 7T!] t— MML_EM(split x). 
actionML(l) A- ML(Ai, /r 1; vEq,7Ti) via Eq. (9) 
/*Perform a single factor addition*/ 

y A- Select a component to add a factor 
[A 2 , /i> 2 , 'h 2 ; 7t 2 ] MML_EM(add factor to y). 
actionML(2) t— ML(A 2 , /^ 2 , 'k 2 , ^ 2 ) via Eq. (9) 

/*Select the best action*/ 

z t— arg min(actionML(l),actionML(2)) 

/*Update the parameters*/ 

[A, /x, \P, 7r] <- [A z ,^ z ,'h 2 ,7r 2 ] 
until MML decrease < e 

/*Annihilation starts with k = K components*/ 

while k > 1 

/*Select the weakest component for annihilation*/ 
[Afc, /j, k , 7Tfc] t— EM(annihilate component), 
k = k - 1 

end 

/*Select l that minimizes MML criterion in Eq. (9)*/ 
return [A z ,/x z ,'P*, 717 ] 1Q 


end 













where Ck denotes the number of parameters for component k, X represents 
the dataset with N data items, 6 the model parameters, and K nz repre¬ 
sents the number of non-zero weight components. The first three terms in 
Eq. [6] comprise the code length for real valued model parameters, the fourth 
term is the model log-likelihood. We propose to include the code length for 
integer hyper parameters, namely K nz and component-wise latent dimen¬ 
sionalities {pk}, such that the encoding becomes decodable as required by 
MDL theory mm- For this purpose, we use Rissanen’s universal prior for 
integers m- 


w*(k ) = C - l 2- lo9 *\ 


(7) 


which gives the (ideal) code length 


L*{k ) = log 1 /w*(k) = log*{k ) + log c, 


( 8 ) 


where log*(k) = logk + loglogk + ... is n-fold logarithmic sum with positive 
terms, c is the normalizing sum Y^k> o 2~ l ° 9 * k that is tightly approximated 
as c = 2.865064 [12]. log*(k ) term in Eq. [8] can be computed via a recur¬ 
sive algorithm. We finally obtain L*(K nz ), the cost to encode the number 
of components, and similarly X]fc- 7 r fe >o L*(jpk), the cost to encode the local 
dimensionalities {pk} and add them to eq. ([6]) to obtain a message length 


criterion: 




fc:7T/ c >0 


+ ^ Ck 2 ~ ~ l °gp(x\ d ) 


( 9 ) 


k: 7 r / e >0 


+ L*{K nz )+ L *M 


k: 7Tfc>0 


li 





3.2. Component Splitting and Factor Addition 

Adding a new component by splitting an existing one involves two de¬ 
cisions: which component to split, and how to split it. AMoFA splits the 
component that looks least likely to a Gaussian, by looking at a multivari¬ 
ate kurtosis metric [32]. For a multinormal distribution, the multivariate 
kurtosis takes the value /3 2 ,d = d(d + 2 ), and if the underlying population 
is multivariate normal with mean /x, the sample counterpart of /? 2 ,d, namely 
& 2 , d, has an asymptotic distribution as the number of samples N goes to in¬ 
finity. Salah and Alpaydm [ 6 j adapted this metric to the mixture model by 
using a “soft count” h .*■ = E[Gj \x t \. 


7 j = {bi d ~d(d + 2)} 


8 d(d + 2 ) 


Eii h) 


0 j 


N 


h J — 

u 2,d ~ 


h l ^ 

2^i=i n j t =1 


5><[iy 






( 10 ) 


( 11 ) 


The component with greatest 7 j is selected for splitting. AMoFA runs a 
local, 2-component MoFA on the data points that fall under the component. 
To initialize the means of new components prior to MoFA fitting, we use the 
weighted sum of all eigenvectors of the local covariance matrix: w = Vi\i, 
and set pi new — n±w, where /x is the mean vector of the component to split. 

The component having the largest difference between modeled and sample 
covariance is selected for factor addition. As in IMoFA, AMoFA uses the 
residual factor addition scheme. Given a component Qj and a set of data 
points x t under it, the re-estimated points after projection to the latent 
subspace can be written as: x l - = Aj£ , [z*|cc t , Qj]. The re-estimation error 
decreases with the number of factors used in A j. The newly added column 
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in the factor loading matrix, A JjP+1 , is selected to be the principal direction 
(the eigenvector with the largest eigenvalue) of the residual vectors x/- — x*j. 
This new factor is used in bootstrapping the EM procedure. 


3.3. Component Annihilation 

In a Bayesian view, the message length criterion (eq. ([9j) ) adopted from 
Figueiredo and Jain [8] corresponds to assuming a flat prior on component 
parameters 0 k , and a Dirichlct prior on mixture proportions 71"*,: 

K K 

J'-nz x *nz 

p{ 7n, • • • , Vk) oc exp{^2 — Y log7Tk ^ = n Ck/2 - ( 12 ) 

k= 1 k =1 

Thus, in order to minimize the adopted cost in eq. (|9]), the M-step of EM is 
changed for 7 T k : 


~new 

7T t = 


max{ 0, ($2i=i kk) ~ yf } 


EjL“^{o,c (13) 

which means that all components having a soft count (Nk) smaller than half 
the number of local parameters Ck will be annihilated. This threshold enables 
the algorithm to get rid of components that do not justify their existence. 
In the special case of AMoFA, the number of parameters per component are 
defined as: 


Ck — d * {jpk + 2) + L*(pk), 


(14) 


where d is the original dataset dimensionality, pk is the local latent dimen¬ 
sionality of component k, and L*(pk ) is the code length for pk- The additive 
constant 2 inside the bracket accounts for the parameter cost of mean pi k 
and local diagonal uniquenesses matrix Finally, the localized annihila¬ 
tion condition to check at the M step of EM is simply N k < T k = Ck/ 2. 
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In AMoFA, we use an outer loop to drive the model class adaptation and 
an inner EM loop to fit a mixture of factor analyzer model with initialized 
parameters. The inner EM algorithm is an improved and more generalized 
version of ULFMM [S], where after parallel EM updates we select the weakest 
component and check Nk < T\ for annihilation, as opposed to sequential 
component update approach (using Component-wise EM - CEM 2 [33]). Any 
time during EM, automatic component annihilation may take place. When 
the incremental progress is saturated, the downsizing component annihilation 
is initiated. The MML based EM algorithm and relevant details are given 
Appendix A 


m 


4. Experiments 

4-1. Evaluation Protocol for Clustering Performance 

We compare AMoFA with two benchmark algorithms on clustering, namely 
ULFMM algorithm from [Hj^] and the IMoFA-L from [6j. 

We use the Normalized Information Distance (NID) metric for evaluating 
clustering accuracy, as it possesses several important properties; in addition 
to being a metric, it admits an analytical adjustment for chance, and allows 
normalization to [0-1] range [34]. NID is formulated as: 

_ MI(u, v) 

max{ H(u),H(v)Y 1 J 

where entropy H{u ) and the mutual information MI(u , v) for clustering are 


lr The code is available at http://www.lx.it.pt/~mtf 
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defined as follows: 


H(u) 


MI(u, v ) 


R 


i =1 
R C 


E di Ui 

N ° S W 


iV 


EEf 1 ^ 


*=1 3 =1 


atbj/N 2 


(16) 

(17) 


Here, a* is the number of samples in cluster i, n.ij is the number of samples 
falling into cluster i in clustering u and cluster j in clustering v. MI is a non¬ 
linear measure of dependence between two random variables. It quantifies 
how much information in bits the two variables share. We compute NID be¬ 
tween the ground truth and the clusterings obtained by the automatic model 
selection techniques in order to give a more precise measure of clustering than 
just the number of clusters. When there is no overlap, NID is expected to 
be close to 0; higher overlap of clusters might result in higher average NID, 
though a relative performance comparison can still be achieved. 


4-2. Experiments on Benchmark Datasets for Clustering 

We tested three algorithms, namely IMoFA-L, AMoFA and ULFMM on 
benchmark synthetic/real datasets for clustering. For maximum compara¬ 
bility with previous work, we used some synthetic dataset examples from 
Figueiredo and Jain [8], as well as from a Yang et al.’s study on automatic 
mixture model selection [25j. 

AMoFA, as opposed to IMoFA and ULFMM, does not rely on random 
initialization. In IMoFA, the first factor is randomly initialized, and in 
ULFMM the initial cluster centers are assigned to randomly selected in¬ 
stances. AMoFA initializes the first factor from the principal component of 
the dataset. Similar to residual factor addition, this scheme can be shown 
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to converge faster than random initializations. Given a dataset, a single 
simulation is sufficient to assess performance. 

Because of this deterministic property of AMoFA, we report the results 
with multiple datasets sampled from the underlying distribution, instead of 
sampling once and simulating multiple times. Unless stated otherwise, in 
the following experiments with synthetic datasets, 100 samples are drawn 
and the average results are reported. For ULFMM, we give initial number 
of clusters K max = 20 in all our simulations for clustering and use free full 
covariances. Moreover, the EM convergence threshold e is set to 10 -5 in all 
three methods. 

Example 1: 3 Separable Gaussians. To illustrate the evolution of the solution 
with AMoFA, we generated a mixture of three Gaussians having the same 
mixture proportions Hi = 7t 2 = 7t 3 = 1/3 and the same covariance matrix 
diag{ 2,0.2} with separate means = [0, — 2]', /x 2 — [0, 0]', /^. 3 = [0,2]'. We 
generate 100 samples from the underlying distribution of 900 data points. 
This synthetic example is used in [HI 125]. Figure [3] shows the evolution of 
adaptive steps of AMoFA with found clusters shown in 2-std contour plots, 
and the description length (DL) is given above each plot. 

Example 2: Overlapping Gaussians. To test the approach for finding the cor¬ 
rect number of clusters, we use a synthetic example very similar to the one 
used in pirn. Here, three of the four Gaussians overlap with the following 
generative model: 

7Ti = 7T 2 = 7T 3 = 0.3, 7r 4 = 0.1, 

Ml = M 2 = [-4 - 4]', Ai 3 = [2 2]', ai 4 = [-1 - 6]', 
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Figure 3: The evolution of AMoFA on a toy synthetic data. To keep the figure uncluttered, 
only the mixture models obtained at the end of adaptive steps are given. The initial step 
fits a single component-single factor model. The first two iterations add components 
to the mixture, and the next one add a factor. The incremental phase stops when no 
(considerable) improvement in the message length is observed. Then, the algorithm starts 
to annihilate the components, until a single component is left. As expected, the DL in the 
decremental phase is higher, since components have two factors. Finally, the algorithm 
selects the 3-component solution having the minimum DL. 


17 























AMoFA 


IMoFA 


ULFMM 




Figure 4: Overlapping Gaussians data. Left: A sample AMoFA result. The real labels 
are shown with colors and resulting AMoFA mixture model is shown with 2-std contour 
plot. Right: Histograms of number of clusters found by AMoFA, IMoFA and ULFMM 
respectively. 
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We use N = 1000 data points. As in the previous example, we generate 
100 random datasets. In figure [4] left plot, the data are illustrated with a 
sample result of AMoFA. Out of 100 simulations, the accuracy of finding 
K*=4 is 92, 56, and 33 for AMoFA, ULFMM, and IMoFA, respectively. The 
histogram in figure [| right plot shows the distribution of number of automat¬ 
ically found clusters for three methods. Average NID over 100 datasets is 
found to be 0.2549, 0.2951, and 0.3377 for AMoFA, ULFMM and IMoFA, re¬ 
spectively. A paired t-test (two tailed) on NID scores indicates that AMoFA 
performs significantly better than ULFMM with p < 10~ 5 . 
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4-3. Application to Classification: Modeling Class Conditional Densities 

We compare AMoFA with three benchmark model selection algorithms, 
namely, VBMoFA algorithm from [7], ULFMM algorithm from [8] and the 
IMoFA algorithm from [6J. As baseline, we use Mixture of Gaussians, where 
the data of each class are modeled by a single Gaussian with full (MoG-F) or 
diagonal (MoG-D) covariances. We compare the performances of the meth¬ 
ods on classification tasks (via class-conditional modeling) on nine benchmark 
datasets: The ORL face database with binary gender classification task [35], 
16-class phoneme database from LVQ package of Helsinki University of Tech¬ 
nology [36], the VISTEX texture database [6], a 6-class Japanese Phoneme 
databas^] [37], the MNIST dataset [38], and four datasets (Letter, Pendig- 
its, Opdigits, and Waveform) from UCI ML Repository [39]. Table [2] gives 
some basic statistics about the databases. Except MNIST that has an ex¬ 
plicit train and testing protocol, all experiments were carried out with 10-fold 
cross-validation. Simulations are replicated 10 times in MNIST, where we 
crop the 4 pixel padding around the images and scale them to 10 x 10 pixels 
to obtain 100-dimensional feature vectors. 

In the experiments, we trained separate mixture models for the samples 
of each class, and used maximum likelihood classification. We did not use 
informative class priors, as it would positively bias the results, and hide the 
impact of likelihood modeling. In Table [3j we provide accuracy computed 
over 10 folds, where all six approaches used the same protocol. ULFMM col¬ 
umn reports performance of ULFMM models with free diagonal covariances, 

2 Pre-processed versions of VISTEX and Japanese Phoneme datasets that are used in 
this study can be accessed from http://www.cmpe.boun.edu.tr/~kaya/jpn_vis.zip 
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Table 2: Datasets Used for Class Conditional Mixture Modeling 


Dataset 

Dimensions 

Classes 

# of Samples 

ORL 

256 

2 

400 

LVQ 

20 

16 

3858 

OPT 

60 

10 

4677 

PEN 

16 

10 

8992 

VIS 

169 

10 

3610 

WAV 

21 

3 

500 

JPN 

112 

6 

1200 

LET 

16 

26 

20000 

MNT 

100 

10 

70000 


as full covariance models invariably give poorer results. 

The best results for a dataset are shown in bold. We compared the algo¬ 
rithms with a non-parametric sign test. For each dataset, we conducted a one 
tail paired-sample t-test with a significance level of 0.05 (0.01 upon of rejec¬ 
tion of null hypothesis). Results indicate that ULFMM ranks last in all cases: 
it is consistently inferior even against the MoG-F baseline. This is because of 
the fact that after randomized initialization of clusters, ULFMM algorithm 
annihilates all illegitimate components, skipping intermediate (possibly bet¬ 
ter than initial) models. On seven datasets AMoFA attains/shares the first 
rank, and on the remaining two it ranks the second. Note that though on 
the overall AMoFA and VBMoFA have similar number of wins against each 
other, on high dimensional datasets, namely on MNIST, VISTEX, Japanese 
Phoneme and ORL, AMoFA significantly outperforms VBMoFA. 
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Table 3: Classification Performances for Class-Conditional Models. Significantly better 
results compared to the first runner up are shown in bold, where * signifies 0.05 significance 
level, while ** corresponds to 0.01 significance level. If there are multiple best performers 
without pair-wise significant difference, they are shown in bold altogether. 



IMoFA-L 0 

VBMoFA 0 

AMoFA 

ORL 

97.8 ± 1.5 

93.0 ± 2.8 

97.5 ± 1.2 

LVQ 

91.2 ± 1.9 

91.3 ± 1.9 

89.3 ± 1.6 

OPT 

91.1 ± 2.7 

95.2 ± 1.8 

93.8 ± 2.4 

PEN 

97.9 ± 0.7 

97.8 ± 0.6 

98.1 ± 0.6 

VIS 

69.3 ± 4.6 

67.1 ± 5.9 

77.2 ± 4.6** 

WAV 

80.8 ± 4.5 

85.1 ± 4.2 

85.6 ± 4.6 

JPN 

93.4 ± 2.4 

93.2 ± 3.2 

96.5 ± 2.2* 

LET 

86.6 ± 1.5 

95.2 ± 0.7 

95.1 ± 0.7 

MNT 

91.5 ± 0.2 

84.5 ± 0.1 

93.9 ± 0 


ULFMM jS] 

MoG-D 

MoG-F 

ORL 

80.0 ± 6.5 

89 ± 2.4 

90 ± 0 

LVQ 

75.4 ± 4.5 

88.1 ± 2.6 

92.1 ± 1.8 

OPT 

49.5 ± 10.2 

84.2 ± 3.1 

94.9 ± 1.7 

PEN 

89.9 ± 2.0 

84.5 ± 2.0 

97.4 ± 0.6 

VIS 

20.6 ± 3.7 

68.6 ± 3.9 

44.7 ± 12.8 

WAV 

72.1 ± 7.5 

80.9 ± 18.2 

84.8 ± 4.6 

JPN 

82.4 ± 2.1 

82.2 ± 4.9 

92.3 ± 2.3 

LET 

56.9 ± 2.8 

64.2 ± 1.2 

88.6 ± 0.9 

MNT 

64.7 ± 2.0 

78.2 ± 0 

93.7 ± 0 
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Table 4: Row Wins/Ties/Loses against Column with 0.05 Significance. 



AMoFA 

VBMoFA 

ULFMM 

MoG-D 

MoG-F 

IMoFA 

1/2/6 

2/4/3 

9/0/0 

7/2/0 

2/3/4 

AMoFA 

* 

4/3/2 

9/0/0 

7/2/0 

5/3/1 

VBMoFA 


* 

9/0/0 

7/2/0 

3/5/1 

ULFMM 



* 

0/3/6 

0/0/9 

MoG-D 




* 

1/2/6 


The results of pairwise tests at 0.05 significance level are shown in Table |4} 
We see that the adaptive MoFA algorithms dramatically outperform GMM 
based ULFMM algorithm. MoFA is capable of exploring a wider range of 
models between diagonal and full covariance with reduced parameterization. 
Among the three MoFA based algorithms, no significant difference (ct = 
0.05) was found on Pendigits dataset. AMoFA outperforms the best results 
reported so far with the VISTEX dataset. The best test set accuracy reported 
in [6j is 73.8 ±1.1 using GMMs. We attain 77.2 ± 4.6 with AMoFA. 
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5. Conclusions and Outlook 


In this study, we propose a novel and adaptive model selection approach 
for Mixtures of Factor Analyzers. Our algorithm first adds factors and com¬ 
ponents to the mixture, and then prunes excessive parameters, thus obtaining 
a parsimonious model in a very time and space efficient way. Our contribu¬ 
tions include a generalization of the adopted MML criterion to reflect local 
parameter costs, as well as local component annihilation thresholds. 

We carry out experiments on many real datasets, and the results indicate 
the superiority of the proposed method in class-conditional modeling. We 
contrast our approach with the Incremental Mo FA approach [6], Variational 
Bayesian MoFA [Tj, as well as the popular ULFMM algorithm [8]. In high 
dimensions, MoFA based automatic modeling provides significantly better 
classification results than GMM based ULFMM modeling, as it is capable of 
modeling a much wider range of models with compact parametrization. It 
also makes use of the latent dimensionality of the local manifold, thus enables 
obtaining an adaptive cost for the description length. AMoFA algorithm is 
observed to offer the best performance on higher dimensional datasets. 

The proposed algorithm does not necessitate a validation set to control 
model complexity. Thanks to the optimized MML criterion and the fast com¬ 
ponent selection measures for incremental adaptation, the algorithm is not 
only robust, but also efficient. It does not have any requirement for parameter 
tuning. Using a recursive version of ULFMM EB. it is also possible to ex¬ 
tend the proposed method for online learning. A MATLAB tool for AMoFA 
is available from http://www.cmpe.boun.edu.tr/~kaya/amofa.zip. 


23 


Appendix A. EM Algorithm for Mixture of Factor Analyzers with 


MML Criterion 


In this section, we give the MoFA EM algorithm optimizing the gener¬ 
alized MML criterion given in eq [9] This criterion is used for automatic 
annihilation of components at the M step. We provide the formulation of 
MML based EM algorithm, which is closely related to regular EM for MoFA 
model |[5]J: 


E[z\Qf t: x ] hi k Gl k [x l^k) 

e [zz'\g k , x*] = h ik (i - n k A k + - t x k )(x t - n k )'tf) 


(A.l) 

(A.2) 

-i 


a,: 


yfftiew 

" k 


TTi- 


y^hjkx'-E \z\x t ,Gk\ > ' S j (y, h jk E[zz'\xj, Q k ]j (A.3) 


N TT k 

„ N 


diag{Y h ikY - E[z\x\ G k })x u } (A.4) 


N 


Y h 

2—1 


ik 


where to keep the notation uncluttered, 2 is defined as 


z 1 


(A.5) 


Similarly, 


A k — 


Afc AC 


, Gl k = A fc (^ fc + A k \ k ) \ and 


h ik = E [Gklx*] oc p(x\ g k ) = n k Af (n k , A k A k + ty k ). (A.6) 


The above EM formulation aims to optimize the MoFA log likelihood, which 
is the logarithm of the linear combination of component likelihoods: 

N K 

p(X\z,G) = log IIE n k J\f (**; n k , A k A k + (A.7) 

i =1 k =1 

The EM Algorithm for MoFA using MML criterion is given in figure |A.5 
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Figure A.5: EM Algorithm for MoFA with MML Criterion 


Require: X data, and initialized MoFA parameter set 6 = 
{/i, A,^,vr} 

REPEAT 


E Step: compute expectations h ik , ~Ei[z\Q kl a;*], E [zz'\Q k , x*] using 


eq. (A.6), (A.l) and (A.2), respectively 


M step: compute model parameters using equations (A.3)-(A.5) 


Compute T k = C k /2 using eq. (14) 

while any component needs annihilation 

Annihilate the weakest component k having N k < T k 


Update 7r k = 7r fe / YjiX\ n h l<k< K, 

end 


new 

nz 


Compute \ogp(X\0) using eq. (A.7) 

Compute message length jC(9, X) using eq. 0 
UNTIL C(0, X) converges with e tolerance 


25 































References 


[1] P. Moerland, Mixture Models for Unsupervised and Supervised Learn¬ 
ing, Ph.D. Thesis, The Swiss Federal Inst, of Tech, at Lausanne, 2000. 

[2] G. McLachlan, D. Peel, Finite Mixture Models, New York: Wiley, 2000. 

[3] A. K. Jain, Data clustering: 50 years beyond K-means, Pattern Recog¬ 
nition Letters 31 (8) (2010) 651-666. doi : 10.1016/j .patrec. 2009. 
09.011. 

[4] C. Bouveyron, C. Brunet-Saumard, Model-based clustering of high¬ 
dimensional data: A review, Computational Statistics & Data Analysis 
71 (2014) 52-78. 

[5] Z. Ghahramani, G. E. Hinton, The em algorithm for mixtures of factor 
analyzers, Tech. Rep. CRG-TR-96-1, University of Toronto (1997). 

[6] A. A. Salah, E. Alpaydm, Incremental mixtures of factor analysers, in: 
Proc. Int. Conf. on Pattern Recognition, 2004, pp. 276-279. 

[7] Z. Ghahramani, M. J. Beal, Variational Inference for Bayesian Mixtures 
of Factor Analysers, in: NIPS, Vol. 12, 1999, pp. 449-455. 

[8] M. A. T. Figueiredo, A. K. Jain, Unsupervised Learning of Finite Mix¬ 
ture Models, IEEE Trans. Pattern Analysis and Machine Intelligence 
24 (3) (2002) 381-396. 

[9] H. Akaike, A new look at the statistical model identification, IEEE 
Trans. Automatic Control 19 (6) (1974) 716-723. 


26 



[10] G. Schwarz, Estimating the Dimension of a Model, Annals of Statistics 
6 (2) (1979) 461-464. 

[11] J. Rissanen, A Universal Prior for Integers and Estimation by MDL, 
The Annals of Statistics 11 (2) (1983) 416-431. 

[12] J. Rissanen, Information and complexity in statistical modeling, Infor¬ 
mation Science and Statistics, Springer, Dordrecht, 2007. 

[13] C. Wallace, P. Freeman, Estimation and inference by compact coding, 
Journal of Royal Statistical Society, Series B 49 (3) (1987) 240-265. 

[14] J. J. Verbeek, N. Vlassis, B. Krose, Efficient greedy learning of Gaussian 
mixture models, Neural computation 15 (2) (2003) 469-485. 

[15] C. E. Rasmussen, The Infinite Gaussian Mixture Model, in: NIPS, 
no. 11, 2000, pp. 554-560. 

[16] R. Gomes, M. Welling, P. Perona, Incremental learning of nonparametric 
bayesian mixture models, in: CVPR, 2008, pp. 1-8. 

[17] L. Shi, S. Tu, L. Xu, Learning Gaussian mixture with automatic model 
selection: A comparative study on three Bayesian related approaches, 
Frontiers of Electr. and Electronic Eng. in China 6 (2) (2011) 215-244. 

[18] D. Pelleg, A. W. Moore, X-means: Extending k-means with efficient 
estimation of the number of clusters, in: ICML, 2000, pp. 727-734. 

[19] M. Law, M. A. T. Figueiredo, A. Jain, Simultaneous feature selection 
and clustering using mixture models, IEEE Trans. Pattern Analysis and 
Machine Intelligence 26 (9) (2004) 1154-1166. 


27 



[20] Z. Zivkovic, F. van der Heijden, Recursive unsupervised learning of finite 
mixture models, IEEE Trans. Pattern Analysis and Machine Intelligence 
26 (5) (2004) 651-656. 

[21] L. Shi, L. Xu, Local factor analysis with automatic model selection: 
A comparative study and digits recognition application, in: S. Kollias, 
A. Stafylopatis, W. Duch, E. Oja (Eds.), Int. Conf. on Artificial Neural 
Networks, Vol. 4132 of Lecture Notes in Computer Science, Springer 
Berlin Heidelberg, 2006, pp. 260-269. 

[22] C. Constantinopoulos, A. Likas, Unsupervised learning of gaussian mix¬ 
tures based on variational component splitting, IEEE Trans. Neural Net¬ 
works 18 (3) (2007) 745-755. 

[23] S. Boutemedjet, N. Bouguila, D. Ziou, A Hybrid Feature Extraction Se¬ 
lection Approach for High-Dimensional Non-Gaussian Data Clustering, 
IEEE Trans. Pattern Analysis and Machine Intelligence 31 (8) (2009) 
1429-1443. 

[24] D. Gorur, C. E. Rasmussen, Nonparametric mixtures of factor analyzers, 
in: IEEE Signal Processing and Communications Applications Conf., 
2009, pp. 708-711. doi:10.1109/SIU.2009.5136494. 

[25] M.-S. Yang, C.-Y. Lai, C.-Y. Lin, A robust EM clustering algorithm 
for Gaussian mixture models, Pattern Recognition 45 (11) (2012) 3950- 
3961. 

[26] T. Iwata, D. Duvenaud, Z. Ghahramani, Warped mixtures for nonpara¬ 
metric cluster shapes (2012). arXiv: 1206.1846. 


[27] W. Fan, N. Bouguila, D. Ziou, Variational learning of finite dirichlet 
mixture models using component splitting, Neurocomputing 129 (2014) 
3 - 16. 

[28] W. Fan, N. Bouguila, Online variational learning of generalized dirichlet 
mixture models with feature selection, Neurocomputing 126 (2014) 166 
- 179. 

[29] J. Kersten, Simultaneous feature selection and Gaussian mixture model 
estimation for supervised classification problems, Pattern Recognition 
47 (8) (2014) 2582 - 2595. doi:http://dx.doi.org/10.1016/j. 
patcog.2014.02.015, 

[30] M. E. Tipping, C. M. Bishop, Mixtures of probabilistic principal com¬ 
ponent analyzers, Neural Comput. 11 (2) (1999) 443-482. 

[31] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from 
incomplete data via the em algorithm, Journal of the Royal Statistical 
Society, Series B 39 (1) (1977) 1-38. 

[32] K. V. Mardia, J. T. Kent, J. M. Bibby, Multivariate Analysis, Proba¬ 
bility and Mathematical Statistics, Academic Press, 1979. 

[33] G. Celeux, S. Chretien, F. Forbes, A. Mkhadri, A component-wise EM 
algorithm for mixtures, Journal of Computational and Graphical Statis¬ 
tics 10 (4) (2001) 697-712. 

[34] V. X. Nguyen, J. Epps, J. Bailey, Information theoretic measures for 
clusterings comparison: Variants, properties, normalization and correc- 


29 



tion for chance, Journal of Machine Learning Research 11 (2010) 2837- 
2854. 

[35] F. S. Samaria, A. C. Harter, Parameterisation of a stochastic model for 
human face identification, in: Prof. WACV, IEEE, 1994, pp. 138-142. 

[36] T. Kohonen, J. Hynninen, J. Kangas, K. Torkkola, LVQ-PAK, Helsinki 
University of Technology (1995). 

[37] F. S. Gurgen, R. Alpaydin, U. Unluakin, E. Alpaydin, Distributed and 
local neural classifiers for phoneme recognition, Pattern Recognition Let¬ 
ters 15 (10) (1994) 1111-1118. 

[38] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning 
applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 
2278-2324. 

[39] A. Frank, A. Asuncion, UCI machine learning repository (2010). 

URL http: //archive. ics . uci . edu/ml 


30 



